Two Step POS Selection for SVM Based Text Categorization

نویسندگان

  • Takeshi Masuyama
  • Hiroshi Nakagawa
چکیده

Although many researchers have verified the superiority of Support Vector Machine (SVM) on text categorization tasks, some recent papers have reported much lower performance of SVM based text categorization methods when focusing on all types of parts of speech (POS) as input words and treating large numbers of training documents. This was caused by the overfitting problem that SVM sometimes selected unsuitable support vectors for each category in the training set. To avoid the overfitting problem, we propose a two step text categorization method with a variable cascaded feature selection (VCFS) using SVM. VCFS method selects a pair of the best number of words and the best POS combination for each category at each step of the cascade. We made use of the difference of words with the highest mutual information for each category on each POS combination. Through the experiments, we confirmed the validation of VCFS method compared with other SVM based text categorization methods, since our results showed that the macro-averaged F1 measure (64.8%) of VCFS method was significantly better than any reported F1 measures, though the micro-averaged F1 measure (85.4%) of VCFS method was similar to them. key words: Text Categorization, Text Classification, Support Vector Machine (SVM), Parts of Speech (POS), Variable Cascaded Feature Selection (VCFS)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature Selection in SVM Text Categorization

This paper investigates the effect of prior feature selection in Support Vector Machine (SVM) text categorization. The input space was gradually increased by using mutual information (MI) filtering and part-of-speech (POS) filtering, which determine the portion of words that are appropriate for learning from the information-theoretic and the linguistic perspectives, respectively. We tested the ...

متن کامل

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

The Role of Parts-of-Speech in Feature Selection

This research explores the role of parts-of-speech (POS) in feature selection in text categorization. We compare the use of different POS, namely nouns, verbs, adjectives and adverbs with a feature set that contains all POS. The best results are obtained with the use of only nouns. Therefore, we make use of a WordNet-based POS feature selection approach using the nouns feature set to compare wi...

متن کامل

Natural Language Processing based Soft Computing Techniques

This paper presents the implementation of soft computing (SC) techniques in the field of natural language processing. An attempt is made to design and implement an automatic tagger that extract a free text and then tag it. The part of speech taggers (POS) is the process of categorization words based on their meaning, functions and types (noun, verb, adjective, etc). Two stages tagging system ba...

متن کامل

MMR-based Feature Selection for Text Categorization

We introduce a new method of feature selection for text categorization. Our MMR-based feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results show that MMR-based feature selection is more effective than Koller & Sahami’s method, which is one of greedy feature selection ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEICE Transactions

دوره 87-D  شماره 

صفحات  -

تاریخ انتشار 2004